SINICA CORPUS : Design Methodology for Balanced Corpora

نویسندگان

  • Keh-Jiann Chen
  • Chu-Ren Huang
  • Li-Ping Chang
  • Hui-Li Hsu
چکیده

The Academia Sinica Balanced Corpus (Sinica Corpus) is the first balanced Chinese corpus with part-of-speech tagging. The corpus (Sinica 2.0) is open to the research community through the WWW (http://www.sinica.edu.twiftms-binikiwi.sh). Current size of the corpus is 3.5 million words, and the immediate expansion target is five million words. Each text in the corpus is classified and marked according to five criteria: genre, style, mode, topic, and source. The feature values of these classifications are assigned in a hierarchy. Subcorpora can be defined with a specific set of attributes to serve different research purposes. Texts in the corpus are segmented according to the word segmentation standard proposed by the ROC Computational Linguistic Society. Each segmented word is tagged with its part-of-speech. Linguistic patterns and language structures can be extracted from the tagged corpus via a corpus inspection program which has the functions of KWIC searching, filtering, statistics, printing, and collocation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

How Should A Large Corpus Be Built? - A Comparative Study Of Closure In Annotated Newspaper Corpora From Two Chinese Sources, Towards Building A Larger Representative Corpus Merged From Representative Sublanguage Collections

This study measures comparative lexical and syntactic closure rates in annotated Chinese newspaper corpora from the Academica Sinica Balanced Corpus and the University of Pennsylvania's Chinese Treebank. It then draws inferences as to how large such corpora need be to be representative models of subject-matterconstrained language domains within the same genre. Future large corpora should be bui...

متن کامل

A Comparison between Microblog Corpus and Balanced Corpus from Linguistic and Sentimental Perspectives

While microblogging has gained popularity on the Internet, analyzing and processing short messages has become a challenging task in natural language processing. This paper analyzes the differences between Internet short messages (or “microtext”) and general articles by comparing the Plurk Corpus and the Sinica Balanced Corpus. Likelihood ratio and the tóngyìcícílín (“ ”) thesaurus are adopted t...

متن کامل

KOTONOHA and BCCWJ: Development of a Balanced Corpus of Contemporary Written Japanese

The National Institute for Japanese Language (NIJL) has launched a long-term language corpus development initiative aiming at the development of a super-corpus called KOTONOHA, which is consisting of a multitude of independent corpora. Among the constituent corpora of KOTONOHA, the one that bears the most urgent need is a largescale balanced corpus of the present-day written Japanese. Construct...

متن کامل

Automatic Acquisition of Linguistic Knowledge: From Sinica Corpus to Gigaword Corpus

The raison d’etre for a corpus, as it was first conceived by Francis and Kucera in 1963, was to provide a body of linguistic facts from which linguistic knowledge could be generalized, [1]. The methods of acquisition have evolved as corpus size and technology have advanced in the past 40 years. Originally corpus-based concordances assisted linguists to form generalizations. This was what Fillmo...

متن کامل

Aspects of speaking-face data corpus design methodology

This paper develops a methodology for the design of audiovideo data corpora of the speaking face. Existing corpora are surveyed and the principles of data specification, data description and statistical representation are analysed both from an application-driven and from a scientifically motivated perspective. Furthermore, the possibility of “opportunistic” design of speaking-face data corpora ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996